From N-Grams to Collocations: An Evaluation of Xtract

نویسنده

Frank Smadja

چکیده

In previous papers we presented methods for retrieving collocations from large samples of texts. We described a tool, X t r a c t , that implements these methods and able to retrieve a wide range of collocations in a two stage process. These methods a.s well as other related methods however have some limitations. Mainly, the produced collocations do not include any kind of functional information and many of them are invalid. In this paper we introduce methods that address these issues. These methods are implemented in an added third stage to X t r a c t that examines the set of collocations retrieved during the previous two stages to both filter out a number of invalid collocations and add useful syntactic information to the retained ones. By combining parsing and statistical techniques the addition of this third stage has raised the overall precision level of X t r a c t from 40% to 80% With a precision of 94%. In the paper we describe the methods and the evaluation experiments. 1 I N T R O D U C T I O N In the past, several approaches have been proposed to retrieve various types of collocations from the analysis of large samples of textual data. Pairwise associations (bigrams or 2-grams) (e.g., [Smadja, 1988], [Church and Hanks, 1989]) as well as n-word (n > 2) associations (or n-grams) (e.g., [Choueka el al., 1983], [Smadja and McKeown, 1990]) were retrieved. These techniques automatically produced large numbers of collocations along with statistical figures intended to reflect their relevance. However, none of these techniques provides functional information along with the collocation. Also, the results produced often contained improper word associations reflecting some spurious aspect of the training corpus that did not stand for true collocations. This paper addresses these two problems. Previous papers (e.g., [Smadja and McKeown, 1990]) introduced a. set of tecl)niques and a. tool, X t r a c t , that produces various types of collocations from a twostage statistical analysis of large textual corpora briefly sketched in the next section. In Sections 3 and 4, we show how robust parsing technology can be used to both filter out a number of invalid collocations as well as add useful syntactic information to the retained ones. This filter/analyzer is implemented in a third stage of Xtract that automatically goes over a the output collocations to reject the invalid ones and label the valid ones with syntactic information. For example, if the first two stages of Xtract produce the collocation "make-decision," the goal of this third stage'is to identify it as a verb-object collocation. If no such syntactic relation is observed, then the collocation is rejected. In Section 5 we present an evaluation of Xtract as a collocation retrieval system. The addition of the third stage of Xtract has been evaluated to raise the precision of X t r a c t from 40% to 80°£ and it has a recall of 94%. In this paper we use examples related to the word "takeover" from a 10 million word corpus containing stock market reports originating from the Associated Press newswire. 2 FIRST 2 STAGES OF X T R A C T , P R O D U C I N G N G R A M S In a f i r s t stage, X t r a c t uses statistical techniques to retrieve pairs of words (or bigrams) whose common ap pearances within a single sentence are correlated in the corpus. A bigram is retrieved if its frequency of occurrence is above a certain threshold and if the words are used in relatively rigid ways. Some bigrams produced by the first stage of X t r a c t are given in Table 1: the bigrams all contain the word "takeover" and an adjective. In the table, the distance parameter indicates the usual distance between the two words. For example, distance = 1 indicates that the two words are frequently adjacent in the corpus. In a second stage, X t r a c t uses the output bigrams to produce collocations involving more than two words (or n-grams). It examines all the sentences containing the bigram and analyzes the statistical distribution of words and parts of speech for each position around the pair. It retains words (or parts of speech) occupying a position with probability greater than a given

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Synonym Relations in Chinese Collocation Extraction

A challenging task in Chinese collocation extraction is to improve both the precision and recall rate. Most lexical statistical methods including Xtract face the problem of unable to extract collocations with lower frequencies than a given threshold. This paper presents a method where HowNet is used to find synonyms using a similarity function. Based on such synonym information, we have success...

متن کامل

INFO256 Project Report Implementation and Evaluation of Xtract in WordSeer

Natural languages are full of word collocations that frequently co-occur and correspond to arbitrary word usages. They appear in both technical and non-technical textual corpora and often have specific significance in individual contexts. Accurately retrieving and identifying collocations from a given corpus in an unsupervised manner is imperative to understanding and automatically generating t...

متن کامل

Retrieving Collocations from Text: Xtract

Natural languages are full of collocations, recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages. Recent work in lexicography indicates that collocations are pervasive in English; apparently, they are common in all types of writing, including both technical and nontechnical genres. Several approaches have been proposed to ...

متن کامل

Automatically Extracting and Representing Collocations for Language Generation

Collocational knowledge is necessary for language generation. The problem is that collocations come in a large variety of forms. They can involve two, three or more words, these words can be of different syntactic categories and they can be involved in more or less rigid ways. This leads to two main difficulties: collocational knowledge has to be acquired and it must be represented flexibly so ...

متن کامل

The Identification and Classification of Unknown Words in Chinese An N-Grams-Based Approach

In this paper, we propose a new approach to identify unknown words in Chinese. This approach adopts an n-grams program to sort out the collocating word / character sequences which are possible words and phrases in Chinese. In addition to proposing the criteria for identifying Chinese new words, was also classify these new words according to their structural and semantic characteristics. The cor...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1991

From N-Grams to Collocations: An Evaluation of Xtract

نویسنده

چکیده

منابع مشابه

Using Synonym Relations in Chinese Collocation Extraction

INFO256 Project Report Implementation and Evaluation of Xtract in WordSeer

Retrieving Collocations from Text: Xtract

Automatically Extracting and Representing Collocations for Language Generation

The Identification and Classification of Unknown Words in Chinese An N-Grams-Based Approach

عنوان ژورنال:

اشتراک گذاری